This repository allows readers to reproduce the findings from the paper 
"Survey Professionalism: New Evidence from Web Browsing Data".

# Setting Up the Computational Environment

The reproduction can be run either building a Docker image, which ensures that the
system-level components on which the analysis was based are identical, as well as 
recreating the state of the R libraries and their versions via `renv`; or simply 
by running `renv` natively.

## Build & run with Docker

Prerequisite: Docker Desktop ≥ 4.x (Windows/macOS) or Docker Engine ≥ 20.x (Linux).
Make sure Docker Desktop is open and showing Docker Desktop is running 
(whale-icon stopped spinning) before you execute the commands below.

```bash
# from the repository root
docker build -t surveyprofessionals .
docker run --rm -v "$PWD":/survey_professionals surveyprofessionals
```
Windows: PowerShell needs ${PWD} instead of $PWD

`docker build` creates an image, running `renv::restore()` 
so every package in `renv.lock` is baked in.
`docker run` launches that image and immediately executes `run.R`.
Outputs are written back to your current host folder mounted via -v "$PWD":/survey_professionals.

`surveyprofessionals` is simply the tag we give the image. 
Replace it with any name you like, as long as you use the same tag 
in both the `docker build` and `docker run` commands.
Likewise, `/survey_professionals` is just the folder inside the container 
where your project is mounted; you can choose another internal path if you prefer;
its name has no relation to whatever you call the project directory on your own computer.
However, it must be identical to the folder defined in the Dockerfile (`WORKDIR /survey_professionals`).

## Run natively with renv

```R
# one-time setup
install.packages("renv")      # skip if already installed
renv::restore()               # installs the exact package versions

# run the analysis
source("run.R")
```
This will install all required packages automatically in a local environment, 
without affecting your global R setup.

## Runtime

A full reproduction took ≈ 10 minutes on an Apple M4 (12-core CPU, 128 GB unified memory).
On a typical recent laptop with 4 – 8 CPU cores and 16–32 GB of RAM, 
the same pipeline will usually complete in ≈ 20–40 minutes. 

# Repository Structure

## `run.R`

This is the main script that runs the full analysis. 
It sources the scripts in the `code/` directory in the correct order 
and exports results and figures used in the study to `output/`. 
This script is automatically run when running the docker image.

## `code/`

Contains all the modular R scripts for data processing, modeling, and visualization. 
Running these scripts individually may not work well, and it is advised to run `master.R`.
All files relying on custom functions, styles and constants defined in `utils/`.

### `0_raw_to_processed/`

This folder contains code applied to raw browsing data. 
We cannot publish these for privacy reasons, 
but provide the scripts processing the raw data for reference. 
Note that the Lucid (LU) and Facebook (FB) raw data was processed 
through an SQL query engine, while the YouGov (YG) data was processed in R. 

- `00-create-regex-bevec-vehovar.R`: create regular expressions to match Bevec & Vehovar (2021) questionnaire sites, saved as `data/browsing_hosts/bevec_url_matches_patterns.txt`. 
- `01-identify-survey-hosts-[FB/LU/YG].[R/sql]`: identifies the URL hosts of survey sites according to our three approaches. For reference, we save some of the resulting lists of hosts under `data/browsing_hosts/`.
- `02-summarize-survey-visits-[FB/LU/YG].[R/sql]`: summarizes number and duration of visits to survey sites by person. Results are saved as `data/browsing_summarized/people_visits_[FB/YG/LU].csv` and are the basis for further analysis.
- `03-identify-repeated-participation-[FB/LU/YG].[R/sql]`: identifies repeated visits to the same questionnaire URL. Results are still raw data and not contained in repository.
- `04-summarize-repeated-participation-[FB/LU/YG].R`: summarizes repeated visits to the same questionnaire URL. Results are saved as `data/browsing_summarized/people_repeated_[FB/YG/LU].csv` and are the basis for further analysis.

### `1-define-survey-profs.R`

Implements four different binary measures of professionalism. Input:

- data/browsing_summarized/people_visits_FB.csv
- data/browsing_summarized/people_visits_LU.csv
- data/browsing_summarized/people_visits_YG.csv

Output:

- data/browsing_summarized/people_prof_FB.csv
- data/browsing_summarized/people_prof_LU.csv
- data/browsing_summarized/people_prof_YG.csv

### `2-recode-weight-surveys.R`

Recodes survey variables needed for analysis and creates weights. Input:

- data/surveys_raw/facebook/fb_people_table.csv
- data/surveys_raw/lucid/us_missing_w0_sociodems.csv
- data/surveys_raw/lucid/US_survey_w0_lucid_raw.csv
- data/surveys_raw/lucid/US_survey_w0_qualtrics_raw.csv
- data/surveys_raw/lucid/US_survey_w1_raw.csv
- data/surveys_raw/lucid/US_survey_w2_raw.csv
- data/surveys_raw/lucid/US_survey_w3_raw.csv
- data/browsing_summarized/people_visits_FB.csv
- data/browsing_summarized/people_visits_YG_anon.csv
- data/surveys_raw/yougov/yg_survey_with_ids_anon_reduced.rds
- data/surveys_raw/yougov/NYUU0010_w1w2_OUTPUT_anon_reduced.sav

Output:

- data/surveys_processed/survey_FB_all.csv
- data/surveys_processed/survey_FB_donors.csv
- data/surveys_processed/survey_LU_all.csv
- data/surveys_processed/survey_LU_donors.csv
- data/surveys_processed/survey_YG_all.csv
- data/surveys_processed/survey_YG_donors.csv

### `3-create-analysis-data.R`

Merge surveys and summarized browsing data into analysis data sets.
For Facebook and Lucid, analysis data sets are in long format (one row per person-wave); 
for YouGov, in wide format (one row per person). Input:

- data/browsing_summarized/people_prof_FB.csv
- data/browsing_summarized/people_prof_LU.csv
- data/browsing_summarized/people_prof_YG.csv
- data/browsing_summarized/people_repeated_FB.csv
- data/browsing_summarized/people_repeated_LU.csv
- data/browsing_summarized/people_repeated_YG.csv
- data/browsing_summarized/people_visits_FB.csv
- data/browsing_summarized/people_visits_LU.csv
- data/browsing_summarized/people_visits_YG_anon.csv
- data/surveys_processed/survey_FB_donors.csv
- data/surveys_processed/survey_LU_donors.csv
- data/surveys_processed/survey_YG_donors.csv

Output:

- data/analysis_FB.csv
- data/analysis_LU.csv
- data/analysis_YG.csv

The latter three data sets are used in all remaining scripts.

### `4-describe-data.R`

Produces descriptives of data reported in the methods etc. 
Text descriptives are printed out when running the script. Other output:

- output/tabB1_donors_vs_nondonors.html
- output/tabB3_survey_variables.html

### `RQ1-professionalism-prevalence.R`

Summarize and plots prevalence of survey professionalism. Output

- output/fig1_rq1_agg_percent_survey_visits.pdf
- output/fig2_rq1_ind_percent_survey_visits_dist.pdf
- output/fig3_rq1_percent_survey_profs.pdf
- output/figC2_rq1_agg_percent_survey_visits_unweighted.pdf
- output/figC3_rq1_ind_percent_survey_visits_dist_unweighted.pdf
- output/figC4_rq1_percent_survey_profs_unweighted.pdf
- output/figC5_rq1_percent_survey_profs_device.pdf
- output/figC6_rq1_ind_percent_survey_visits_dist_device.pdf
- output/figC7_rq1_percent_survey_profs_device.pdf
- output/figC8_rq1_agg_percent_survey_visits_approaches.pdf
- output/figC9_rq1_survey_visits_top_hosts.pdf
- output/figC10_rq1_ind_percent_survey_visits.pdf
- output/figC11_rq1_agg_percent_survey_duration.pdf
- output/figC12_rq1_ind_percent_survey_duration_dist.pdf
- output/figC13_rq1_ind_percent_survey_duration.pdf
- output/figC14_rq1_sensitivity_attrition.pdf
- output/figC15_rq1_sensitivity_bounds.pdf

### `RQ2-sociodemographics.R`

Compares professionals and non-professionals on sociodemographics and political outcomes. Output:

- output/tab1_rq2_comparison_sociodems.html
- output/tabD4_rq2_comparison_sociodems_2.html
- output/tabD5_rq2_comparison_sociodems_3.html
- output/tabD6_rq2_comparison_sociodems_any.html

### `RQ3ab-speeding-straightlining.R`

Compares professionals and non-professionals on speeding and straightlining 
(and treatment effects). Output:

- output/tab2_rq2_comparison_quality.html
- output/tabE7_rq2_comparison_quality_2.html
- output/tabE8_rq2_comparison_quality_3.html
- output/tabE9_rq2_comparison_quality_any.html
- output/tabE10_rq3_treatment_effects_malvol.html
- output/tabE11_rq3_treatment_effects_perpol.html

### `RQ3c-response-instability.R` 

Runs models to compare professionals / non-professionals regarding between-wave 
response stability. The heteroscedastic Bayesian model relies on RQ3c_heterocedastic_model.stan
and will take some time to run. If your computational power is limited, you 
might have to modify the settings `mc.cores = parallel::detectCores()` and 
`chains = 4`. Produces the following plots:

- output/Fig4_rq3_zscores_bayesian.pdf
- output/FigE16_rq3_differences_bayesian.pdf
- output/FigE17_rq3_zscores_absdiff.pdf
- output/FigE18_rq3_effects_absdiff.pdf

### `RQ4-repeated-participation.R`

Quantifies attempts at repeated participation and 
compares professionals and non-professionals. Output:

- output/tab3_rq4_repeated_participation.html
- output/tab4_rq4_repeated_participation_prof.html
- output/tabF12_rq4_repeated_participation_6h.html
- output/tabF13_rq4_repeated_participation_6h_prof.html
- output/tabF14_rq4_repeated_participation_24h.html
- output/tabF15_rq4_repeated_participation_24h_prof.html
- output/tabF16_rq4_repeated_participation_prof_2.html
- output/tabF17_rq4_repeated_participation_prof_3.html
- output/tabF18_rq4_repeated_participation_prof_any.html
- output/tabF19_rq4_repeated_participation_platforms.html
- output/tabF20_rq4_repeated_participation_groups_age.html
- output/tabF21_rq4_repeated_participation_groups_race.html
- output/tabF22_rq4_repeated_participation_groups_partisanship.html
- output/tabF23_rq4_break_patterns.html

## `data/`

Contains the three analysis data sets used for the main analyses. 
The following subfolders contain data at various processing stages:

### `browsing_hosts/`

Contains lists of survey sites (URL hosts) produced during 
the identification of survey taking and are included for reference.

### `browsing_summarized/`

Contains browsing data aggregated to the person or platform level. 

### `surveys_raw/`

Contains raw survey data, data about the US population and external data 
sources used in the SM.

### `surveys_processed/`

Contains processed survey data.

## `output/`

Contains figures reported in the main paper and SM as listed above.
Tables are exported as HTML, but can be saved as Latex by adapting the 
individual scripts.

 

       


          
       

          


        




